Note: This is a sample solution for the project. Projects will NOT be graded on the basis of how well the submission matches this sample solution. Projects will be graded on the basis of the rubric only.¶
Problem Statement¶
Business Context¶
Understanding customer personality and behavior is pivotal for businesses to enhance customer satisfaction and increase revenue. Segmentation based on a customer's personality, demographics, and purchasing behavior allows companies to create tailored marketing campaigns, improve customer retention, and optimize product offerings.
A leading retail company with a rapidly growing customer base seeks to gain deeper insights into their customers' profiles. The company recognizes that understanding customer personalities, lifestyles, and purchasing habits can unlock significant opportunities for personalizing marketing strategies and creating loyalty programs. These insights can help address critical business challenges, such as improving the effectiveness of marketing campaigns, identifying high-value customer groups, and fostering long-term relationships with customers.
With the competition intensifying in the retail space, moving away from generic strategies to more targeted and personalized approaches is essential for sustaining a competitive edge.
Objective¶
In an effort to optimize marketing efficiency and enhance customer experience, the company has embarked on a mission to identify distinct customer segments. By understanding the characteristics, preferences, and behaviors of each group, the company aims to:
- Develop personalized marketing campaigns to increase conversion rates.
- Create effective retention strategies for high-value customers.
- Optimize resource allocation, such as inventory management, pricing strategies, and store layouts.
As a data scientist tasked with this project, your responsibility is to analyze the given customer data, apply machine learning techniques to segment the customer base, and provide actionable insights into the characteristics of each segment.
Data Dictionary¶
The dataset includes historical data on customer demographics, personality traits, and purchasing behaviors. Key attributes are:
Customer Information
- ID: Unique identifier for each customer.
- Year_Birth: Customer's year of birth.
- Education: Education level of the customer.
- Marital_Status: Marital status of the customer.
- Income: Yearly household income (in dollars).
- Kidhome: Number of children in the household.
- Teenhome: Number of teenagers in the household.
- Dt_Customer: Date when the customer enrolled with the company.
- Recency: Number of days since the customer’s last purchase.
- Complain: Whether the customer complained in the last 2 years (1 for yes, 0 for no).
Spending Information (Last 2 Years)
- MntWines: Amount spent on wine.
- MntFruits: Amount spent on fruits.
- MntMeatProducts: Amount spent on meat.
- MntFishProducts: Amount spent on fish.
- MntSweetProducts: Amount spent on sweets.
- MntGoldProds: Amount spent on gold products.
Purchase and Campaign Interaction
- NumDealsPurchases: Number of purchases made using a discount.
- AcceptedCmp1: Response to the 1st campaign (1 for yes, 0 for no).
- AcceptedCmp2: Response to the 2nd campaign (1 for yes, 0 for no).
- AcceptedCmp3: Response to the 3rd campaign (1 for yes, 0 for no).
- AcceptedCmp4: Response to the 4th campaign (1 for yes, 0 for no).
- AcceptedCmp5: Response to the 5th campaign (1 for yes, 0 for no).
- Response: Response to the last campaign (1 for yes, 0 for no).
Shopping Behavior
- NumWebPurchases: Number of purchases made through the company’s website.
- NumCatalogPurchases: Number of purchases made using catalogs.
- NumStorePurchases: Number of purchases made directly in stores.
- NumWebVisitsMonth: Number of visits to the company’s website in the last month.
Let's start coding!¶
Importing necessary libraries¶
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to compute distances
from scipy.spatial.distance import cdist, pdist
# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# to suppress warnings
import warnings
warnings.filterwarnings("ignore")
Loading the data¶
# Mounting Google Drive in Google Colab
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# loading data into a pandas dataframe
data = pd.read_csv("/content/drive/MyDrive/marketing_campaign.csv", sep="\t")
Data Overview¶
Question 1: What are the data types of all the columns?¶
data.info()
data.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2216 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Z_CostContact 2240 non-null int64 27 Z_Revenue 2240 non-null int64 28 Response 2240 non-null int64 dtypes: float64(1), int64(25), object(3) memory usage: 507.6+ KB
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 04-09-2012 | 58 | 635 | 88 | 546 | 172 | 88 | 88 | 3 | 8 | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 08-03-2014 | 38 | 11 | 1 | 6 | 2 | 1 | 6 | 2 | 1 | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 21-08-2013 | 26 | 426 | 49 | 127 | 111 | 21 | 42 | 1 | 8 | 2 | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 10-02-2014 | 26 | 11 | 4 | 20 | 10 | 3 | 5 | 2 | 2 | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 19-01-2014 | 94 | 173 | 43 | 118 | 46 | 27 | 15 | 5 | 5 | 3 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
Observations:
- we can see that there is in total 29 columns among which only 3 are objects.
- we have here 2240 entries or rows.
Question 2: Check the statistical summary of the data. What is the average household income?¶
# the statistical summary of the data
data.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | 2240.0 | NaN | NaN | NaN | 5592.159821 | 3246.662198 | 0.0 | 2828.25 | 5458.5 | 8427.75 | 11191.0 |
| Year_Birth | 2240.0 | NaN | NaN | NaN | 1968.805804 | 11.984069 | 1893.0 | 1959.0 | 1970.0 | 1977.0 | 1996.0 |
| Education | 2240 | 5 | Graduation | 1127 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Marital_Status | 2240 | 8 | Married | 864 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Income | 2216.0 | NaN | NaN | NaN | 52247.251354 | 25173.076661 | 1730.0 | 35303.0 | 51381.5 | 68522.0 | 666666.0 |
| Kidhome | 2240.0 | NaN | NaN | NaN | 0.444196 | 0.538398 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 |
| Teenhome | 2240.0 | NaN | NaN | NaN | 0.50625 | 0.544538 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 |
| Dt_Customer | 2240 | 663 | 31-08-2012 | 12 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Recency | 2240.0 | NaN | NaN | NaN | 49.109375 | 28.962453 | 0.0 | 24.0 | 49.0 | 74.0 | 99.0 |
| MntWines | 2240.0 | NaN | NaN | NaN | 303.935714 | 336.597393 | 0.0 | 23.75 | 173.5 | 504.25 | 1493.0 |
| MntFruits | 2240.0 | NaN | NaN | NaN | 26.302232 | 39.773434 | 0.0 | 1.0 | 8.0 | 33.0 | 199.0 |
| MntMeatProducts | 2240.0 | NaN | NaN | NaN | 166.95 | 225.715373 | 0.0 | 16.0 | 67.0 | 232.0 | 1725.0 |
| MntFishProducts | 2240.0 | NaN | NaN | NaN | 37.525446 | 54.628979 | 0.0 | 3.0 | 12.0 | 50.0 | 259.0 |
| MntSweetProducts | 2240.0 | NaN | NaN | NaN | 27.062946 | 41.280498 | 0.0 | 1.0 | 8.0 | 33.0 | 263.0 |
| MntGoldProds | 2240.0 | NaN | NaN | NaN | 44.021875 | 52.167439 | 0.0 | 9.0 | 24.0 | 56.0 | 362.0 |
| NumDealsPurchases | 2240.0 | NaN | NaN | NaN | 2.325 | 1.932238 | 0.0 | 1.0 | 2.0 | 3.0 | 15.0 |
| NumWebPurchases | 2240.0 | NaN | NaN | NaN | 4.084821 | 2.778714 | 0.0 | 2.0 | 4.0 | 6.0 | 27.0 |
| NumCatalogPurchases | 2240.0 | NaN | NaN | NaN | 2.662054 | 2.923101 | 0.0 | 0.0 | 2.0 | 4.0 | 28.0 |
| NumStorePurchases | 2240.0 | NaN | NaN | NaN | 5.790179 | 3.250958 | 0.0 | 3.0 | 5.0 | 8.0 | 13.0 |
| NumWebVisitsMonth | 2240.0 | NaN | NaN | NaN | 5.316518 | 2.426645 | 0.0 | 3.0 | 6.0 | 7.0 | 20.0 |
| AcceptedCmp3 | 2240.0 | NaN | NaN | NaN | 0.072768 | 0.259813 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| AcceptedCmp4 | 2240.0 | NaN | NaN | NaN | 0.074554 | 0.262728 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| AcceptedCmp5 | 2240.0 | NaN | NaN | NaN | 0.072768 | 0.259813 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| AcceptedCmp1 | 2240.0 | NaN | NaN | NaN | 0.064286 | 0.245316 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| AcceptedCmp2 | 2240.0 | NaN | NaN | NaN | 0.013393 | 0.114976 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Complain | 2240.0 | NaN | NaN | NaN | 0.009375 | 0.096391 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Z_CostContact | 2240.0 | NaN | NaN | NaN | 3.0 | 0.0 | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 |
| Z_Revenue | 2240.0 | NaN | NaN | NaN | 11.0 | 0.0 | 11.0 | 11.0 | 11.0 | 11.0 | 11.0 |
| Response | 2240.0 | NaN | NaN | NaN | 0.149107 | 0.356274 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
Observations:¶
-the average household income is 52247.3 dollars.
Question 3: Are there any missing values in the data? If yes, treat them using an appropriate method¶
# prompt: missing values
# Check for missing values
print(data.isnull().sum())
# Treat missing values in 'Income' column using mean imputation
data['Income'] = data['Income'].fillna(data['Income'].mean())
# Verify if missing values are handled
print(data.isnull().sum())
ID 0 Year_Birth 0 Education 0 Marital_Status 0 Income 24 Kidhome 0 Teenhome 0 Dt_Customer 0 Recency 0 MntWines 0 MntFruits 0 MntMeatProducts 0 MntFishProducts 0 MntSweetProducts 0 MntGoldProds 0 NumDealsPurchases 0 NumWebPurchases 0 NumCatalogPurchases 0 NumStorePurchases 0 NumWebVisitsMonth 0 AcceptedCmp3 0 AcceptedCmp4 0 AcceptedCmp5 0 AcceptedCmp1 0 AcceptedCmp2 0 Complain 0 Z_CostContact 0 Z_Revenue 0 Response 0 dtype: int64 ID 0 Year_Birth 0 Education 0 Marital_Status 0 Income 0 Kidhome 0 Teenhome 0 Dt_Customer 0 Recency 0 MntWines 0 MntFruits 0 MntMeatProducts 0 MntFishProducts 0 MntSweetProducts 0 MntGoldProds 0 NumDealsPurchases 0 NumWebPurchases 0 NumCatalogPurchases 0 NumStorePurchases 0 NumWebVisitsMonth 0 AcceptedCmp3 0 AcceptedCmp4 0 AcceptedCmp5 0 AcceptedCmp1 0 AcceptedCmp2 0 Complain 0 Z_CostContact 0 Z_Revenue 0 Response 0 dtype: int64
Observations:¶
- We have imputed the mean value of Income to the missing values.
- Since we haven't clusters yet we can only impute the overall mean to all the missing values.
- The 24 missing values in Income are now feeded with the mean of the column.
Question 4: Are there any duplicates in the data?¶
# Check for duplicates
duplicates = data.duplicated()
print(f"Number of duplicate rows: {duplicates.sum()}")
Number of duplicate rows: 0
Observations:¶
- There is no duplicated rows in the data.
Exploratory Data Analysis¶
Univariate Analysis¶
Question 5: Explore all the variables and provide observations on their distributions. (histograms and boxplots)¶
# Loop through each numerical column in the DataFrame
for col in data.select_dtypes(include=np.number):
plt.figure(figsize=(12, 4)) # Adjust figure size as needed
# Histogram
plt.subplot(1, 2, 1)
sns.histplot(data[col], kde=True) # Include KDE for better visualization
plt.title(f'Histogram of {col}')
# Boxplot
plt.subplot(1, 2, 2)
sns.boxplot(y=data[col])
plt.title(f'Boxplot of {col}')
plt.show()
Observations:¶
- We can see that all the customers are born before the year 2000. Also few customers are born the last century.This is maybe due to an error in taping data.
- for the Income distribution, it is quite symetric suggesting that it follows a known distribution like normal distribution. Some outliers indicating that some customers have very high income.
- The customers have at most 2 children and 2 teenagers at home and the majority of theme have no kids.
- The destribution of Recency is similair to the uniforme distribution.
- The distribution of the amount spent on food( wine, fish, sweets, meat and fruits) and gold are similair : a great bloc spending no money and after that a slow decreasing until the outliers.
- 50% of the purchases made using a discount have been done less than 2 times.
- The distribution of the numbre of purchases made through the company's website and using a catalog are right skewed with outliers around 25 times. -13 times is the maximum of numbre of times the customers have buy directly in the store. -We notice that the distribution of the numbre of visits to the company's website is left skewed and that 50% of the visitors have done it less than 6 times per mounth. -Except fot the second compaign, the numbre of customers responding on the compagns that the company has promoted is quite similair. For the second compaign, this number is inferior to the other compaigns.
# univariate analysis categorical count
# Univariate analysis for categorical features
for col in data.select_dtypes(include=['object']):
plt.figure(figsize=(8, 6))
sns.countplot(x=col, data=data)
plt.title(f'Countplot of {col}')
plt.xticks(rotation=90) # Rotate x-axis labels for better readability
plt.show()
print("-" * 30)
------------------------------
------------------------------
------------------------------
Observations :
- The majority of the customers are graduated and married.
- There is no need to look at the id of the customers since it gives no valuable information.
Bivariate Analysis¶
Question 6: Perform multivariate analysis to explore the relationsips between the variables.¶
# Create a heatmap of the correlation matrix
plt.figure(figsize=(30, 20))
num_data=data.select_dtypes(include=np.number)
sns.heatmap(num_data.corr(), annot=True, cmap='Spectral')
plt.title('Correlation Matrix Heatmap')
plt.show()
- The year of birth is corralated with the number of kids at home and negatively correlated with these of teenagers : the reason is evident. the oldest adults have more chance to have teengers living outside the house.
- A surprising fact is that the more the custumer is rich the less he has a kid at home.
- The Income is correlated positively with the variables represanting the amount of purchase of food and gold. -An interesting fact is that while the variable Kidhome is negatively correlated with all the variables of food and gold, the variable Teenhome isn't correlated with the two variables of the amount of purchasing wine and gold. -The variable NumDealsPurchases is positively correlated with Teenhome.Wich suggests that the more teens the customers have at home the more they use discount in their purchases. This variable is also positively correlated with the number of visits to the compagny's website. -Unfortunetely it would be too long to analyze all the correlations between the variables, that's why we will pick only the most important of theme. -The variables of the response to the compagny's compaigns are correlated between theme. That means that if a custumer respond to a compaign he would likely respond to the others.
# Bivariate analysis for 'Total Spent' (create a new 'TotalSpent' column)
data['TotalSpent'] = data['MntWines'] + data['MntFruits'] + data['MntMeatProducts'] + data['MntFishProducts'] + data['MntSweetProducts'] + data['MntGoldProds']
# Plot TotalSpent against other relevant variables
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Income', y='TotalSpent', data=data)
plt.title('Total Spent vs. Income')
plt.show()
plt.figure(figsize=(10, 6))
sns.boxplot(x='Education', y='TotalSpent', data=data)
plt.title('Total Spent vs. Education')
plt.show()
plt.figure(figsize=(10, 6))
sns.boxplot(x='Education', y='Income', data=data)
plt.title('Income vs. Education')
plt.show()
plt.figure(figsize=(10,6))
sns.boxplot(x='Marital_Status', y='TotalSpent', data=data)
plt.title('Total Spent vs. Marital Status')
plt.show()
# Pairplot for selected numerical features
selected_features = ['Income', 'Recency', 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth', 'TotalSpent']
sns.pairplot(data[selected_features])
plt.show()
- Among all the relevant variables that exist in the data, we can identify the Income and the total spent as the most important for the compagny.
- As seen before, Income and total spent are positively correlated.
- The level of education is a very determinant factor of spennding, as we can notice that the custumers who have a level of education equal or superior to the graduation are the ones who spend more. Is that due to their high income.
- The plot Income vs. Education confirme that this category of educated customers has more income than the basic one.
- The customers who live alone spend less money comparatively with the others. -we notice that there is no correlation between the number of visits to the compagny's website and the purchase from that website.
K-means Clustering¶
Question 7 : Select the appropriate number of clusters using the elbow Plot. What do you think is the appropriate number of clusters?¶
# Scale the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(num_data)
# Determine the optimal number of clusters using the Elbow method
score = []
for i in range(1, 14):
kmeans = KMeans(n_clusters=i)
kmeans.fit(scaled_data)
score.append(kmeans.inertia_)
plt.plot(range(1, 14), score)
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('score')
plt.show()
Observations:¶
- As we can see the numbre of clusters that the elbow suggests is 2. Let's check with the silouhette method if this number of clusters is relevant.
Question 8 : finalize appropriate number of clusters by checking the silhoutte score as well. Is the answer different from the elbow plot?¶
# Silhouette Analysis
visualizer = SilhouetteVisualizer(KMeans(2))
visualizer.fit(scaled_data)
visualizer.poof()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 2240 Samples in 2 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
visualizer = SilhouetteVisualizer(KMeans(3))
visualizer.fit(scaled_data)
visualizer.poof()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 2240 Samples in 3 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
Observations:¶
- The silhoutte method seems to confirm that the optimum number of the clusters is 2 since it has the higher score (0.25).
Question 9: Do a final fit with the appropriate number of clusters. How much total time does it take for the model to fit the data?¶
import time
start_time = time.time()
kmeans = KMeans(n_clusters=2, random_state=42) # Use the appropriate number of clusters
kmeans.fit(scaled_data)
end_time = time.time()
time_2clusters_km = end_time - start_time
print(f"Total fitting time: {time_2clusters_km:.4f} seconds")
Total fitting time: 0.0254 seconds
Observations:¶
-To fit the model, the algorithm has taken 0.0254 seconds. We will compare that time later with another algorithm.
Hierarchical Clustering¶
Question 10: Calculate the cophnetic correlation for every combination of distance metrics and linkage. Which combination has the highest cophnetic correlation?¶
# Calculate the cophenetic correlation for every combination of distance metrics and linkage methods
distance_metrics = ['euclidean', 'minkowski', 'chebyshev']
linkage_methods = ['complete', 'average', 'single']
results = []
for metric in distance_metrics:
for method in linkage_methods:
Z = linkage(scaled_data, method=method, metric=metric)
c, coph_dists = cophenet(Z, pdist(scaled_data, metric=metric))
results.append([metric, method, c])
# Convert the results to a DataFrame for easier view and analysis
cophenetic_df = pd.DataFrame(results, columns=['Distance Metric', 'Linkage Method', 'Cophenetic Correlation'])
# Find the combination with the highest cophenetic correlation
highest_correlation = cophenetic_df['Cophenetic Correlation'].max()
best_combination = cophenetic_df[cophenetic_df['Cophenetic Correlation'] == highest_correlation]
print(f"Highest Cophenetic Correlation: {highest_correlation:.4f}")
print(f"Best combination:\n{best_combination}")
# Display the cophenetic correlations for all combinations
print("\nCophenetic Correlations:")
cophenetic_df
Highest Cophenetic Correlation: 0.9590 Best combination: Distance Metric Linkage Method Cophenetic Correlation 7 chebyshev average 0.95896 Cophenetic Correlations:
| Distance Metric | Linkage Method | Cophenetic Correlation | |
|---|---|---|---|
| 0 | euclidean | complete | 0.595966 |
| 1 | euclidean | average | 0.896379 |
| 2 | euclidean | single | 0.843518 |
| 3 | minkowski | complete | 0.595966 |
| 4 | minkowski | average | 0.896379 |
| 5 | minkowski | single | 0.843518 |
| 6 | chebyshev | complete | 0.855965 |
| 7 | chebyshev | average | 0.958960 |
| 8 | chebyshev | single | 0.897534 |
Observations:¶
- The best combination of distance metrics and linkage is given by chebyshev and average respectivelly. -The cophenetic correlation for this combination is 0.95896.
Question 11: plot the dendogram for every linkage method with "Euclidean" distance only. What should be the appropriate linkage according to the plot?¶
#hierarchical Clustering Dendrogram
#plt.figure(figsize=(10, 10))
plt.title("Dendrograms")
# Calculate linkage matrix for 'complete' linkage
linked = linkage(scaled_data, 'complete', metric='euclidean')
dendrogram(linked,
orientation='top',
distance_sort='descending',
show_leaf_counts=True)
plt.show()
#plt.figure(figsize=(10, 10))
plt.title("Dendrograms")
linked = linkage(scaled_data, 'average', metric='euclidean')
dendrogram(linked,
orientation='top',
distance_sort='descending',
show_leaf_counts=True)
plt.show()
#plt.figure(figsize=(10, 10))
plt.title("Dendrograms")
linked = linkage(scaled_data, 'single', metric='euclidean')
dendrogram(linked,
orientation='top',
distance_sort='descending',
show_leaf_counts=True)
plt.show()
linked = linkage(scaled_data, 'ward', metric='euclidean')
dendrogram(linked,
orientation='top',
distance_sort='descending',
show_leaf_counts=True)
plt.show()
Observations:¶
-Among all the methods of linkage, the ward method is the one who gives the best hierarchy.
Question 12: Check the silhoutte score for the hierchial clustering. What should be the appropriate number of clusters according to this plot?¶
# Calculate Silhouette Score for Hierarchical Clustering
range_n_clusters = range(2,14)
silhouette_scores = []
for n_clusters in range_n_clusters:
hierarchical_cluster = AgglomerativeClustering(n_clusters=n_clusters, metric='euclidean', linkage='ward')
cluster_labels = hierarchical_cluster.fit_predict(scaled_data)
silhouette_avg = silhouette_score(scaled_data, cluster_labels)
silhouette_scores.append(silhouette_avg)
plt.plot(range_n_clusters, silhouette_scores)
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Score vs. Number of Clusters")
plt.show()
best_n_clusters_hierarchical = range_n_clusters[np.argmax(silhouette_scores)]
best_n_clusters_hierarchical
silhouette_score
print(f"Best number of clusters (Hierarchical) based on silhouette score: {best_n_clusters_hierarchical}")
Best number of clusters (Hierarchical) based on silhouette score: 2
print(silhouette_scores)
[0.21123833296375513, 0.19192664391999967, 0.20176352122354696, 0.21070970107328682, 0.20995015294045005, 0.18464592084365147, 0.09352798652840015]
Observations:¶
- According to the plot, the number of clusters showing the maximum silhouette score is 2.
- This is in total accord with what we found precedently in the K-Means analysis.
Question 13: Fit the Hierarchial clustering model with the appropriate parameters finalized above. How much time does it take to fit the model?¶
import time
start_time = time.time()
hierarchical_cluster = AgglomerativeClustering(n_clusters=2, metric='euclidean', linkage='ward')
cluster_labels = hierarchical_cluster.fit_predict(scaled_data)
end_time = time.time()
time_2clusters_hr = end_time - start_time
print(f"Hierarchical clustering fitting time: {time_2clusters_hr:.4f} seconds")
Hierarchical clustering fitting time: 0.3568 seconds
Observations:¶
-To fit the model, the algorithm has taken 0.1702 seconds. We observe that the K-Means model is 8 times faster than the hierarchical model.
Cluster Profiling and Comparison¶
K-Means Clustering vs Hierarchical Clustering Comparison¶
Question 14: Perform and compare Cluster profiling on both algorithms using boxplots. Based on the all the observaions Which one of them provides better clustering?¶
# K-Means Clustering
kmeans = KMeans(n_clusters=2)
kmeans.fit(scaled_data)
kmeans_labels = kmeans.labels_
# Hierarchical Clustering
hierarchical_cluster = AgglomerativeClustering(n_clusters=2, metric='euclidean', linkage='ward')
hierarchical_labels = hierarchical_cluster.fit_predict(scaled_data)
#add cluster labels to the original dataframe
data['KMeans_Cluster'] = kmeans_labels
data['Hierarchical_Cluster'] = hierarchical_labels
# Cluster Profiling using boxplots
numerical_cols = data.select_dtypes(include=np.number).columns
for col in numerical_cols:
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.boxplot(data=data, x='KMeans_Cluster', y=col)
plt.title(f'KMeans Clustering - {col}')
plt.subplot(1, 2, 2)
sns.boxplot(data=data, x='Hierarchical_Cluster', y=col)
plt.title(f'Hierarchical Clustering - {col}')
plt.tight_layout() # Adjust layout to prevent overlapping titles
plt.show()
Observations:¶
It seems that the KMeans algorithm does separate the two clusters with the less outliers than the hierarchical one.
We can then conclude that the KMeans clustering in this case is faster and performer than the hierarchical clustering.
For the next question we will use this KMean algorithm.
Question 15: Perform Cluster profiling on the data with the appropriate algorithm determined above using a barplot. What observations can be derived for each cluster from this plot?¶
# Cluster Profiling using barplots
for col in numerical_cols:
plt.figure(figsize=(10, 6))
column_average=data.groupby('KMeans_Cluster')[col].mean().plot(kind='bar')
plt.title(f'KMeans Clustering - Average {col} per Cluster')
plt.xlabel('Cluster')
plt.ylabel(f'Average {col}')
plt.show()
alpha=data.groupby('KMeans_Cluster').Income.mean()
alpha
| Income | |
|---|---|
| KMeans_Cluster | |
| 0 | 72217.750296 |
| 1 | 39300.960281 |
Observations:¶
- The KMeans clusturing gives 2 clusters similar in the term of number of population.
- The first cluster has more Income than the seconde one (70K vs 40K).
- We can observe that the cluster 1 has in his majority no kids at home. The fact that the average is not equal to 0 for this cluster is due to the existence of few customers having kids at home.
- For habits in purchasing foods and golds, the first cluster spends in average more than the second one.
- This first cluster has also another habit : In avergae his poulation spends more money in store as well as on the company's website. They also tend to purchase using catalogues but not discount.
- The second cluster seems to respond less to the comagny's compaigns in comparaison with the first cluster. -The complains are more likely to happen with the second cluster than with the first one.
Business Recommedations¶
- We have seen that 3 clusters are distinctly formed using both methodologies and the clusters are analogous to each other.
- Cluster 1 has premium customers with a high credit limit and more credit cards, indicating that they have more purchasing power. The customers in this group have a preference for online banking.
- Cluster 0 has customers who prefer to visit the bank for their banking needs than doing business online or over the phone. They have an average credit limit and a moderate number of credit cards.
- Cluster 2 has more overhead of customers calling in, and the bank may need to spend money on call centers.
Here are 5–7 actionable business recommendations based on the cluster profiling:
1. Focus on Retaining High-Value Customers (Cluster 3)¶
- Offer Exclusive Loyalty Programs: Provide tailored loyalty benefits, early access to products, and exclusive discounts to maintain engagement and drive repeat purchases.
- Upsell and Cross-Sell: Introduce premium products or bundles targeting their high spending patterns across product categories like wines, gold products, and meats.
- Personalized Campaigns: Use their high response rate to create personalized campaigns highlighting products they prefer.
2. Activate Potential in Moderate-Spending Customers (Cluster 2)¶
- Incentivize Higher Engagement: Offer targeted discounts or special offers to encourage increased spending and purchases across channels.
- Educate About Products: Provide content (emails, guides, or social media) showcasing the value and uniqueness of products they don’t purchase frequently.
- Improve Campaign Effectiveness: Refine campaign messaging based on their moderate response rate to increase acceptance.
3. Reengage Low-Value Customers (Cluster 1)¶
- Win-Back Campaigns: Implement campaigns specifically aimed at bringing back inactive customers, such as offering steep discounts or limited-time offers.
- Understand Barriers to Engagement: Conduct surveys or collect feedback to identify reasons for their low purchases and disengagement.
- Promote Entry-Level Products: Introduce affordable or trial-sized products to ease them into higher spending.
4. Convert Browsers into Buyers (Cluster 0)¶
- Optimize Website Experience: Since Cluster 0 has high website visits but low spending, improve website navigation, showcase popular products, and streamline the checkout process.
- Targeted Digital Campaigns: Retarget these users with ads or emails featuring products they browsed but didn’t purchase.
- Offer Online-Exclusive Discounts: Provide web-only discounts or promotions to convert visits into purchases.
5. Strengthen Digital and Multi-Channel Strategies¶
- Seamless Omni-Channel Experience: Ensure a consistent shopping experience across all channels (web, catalog, and store) to encourage cross-channel engagement, especially for Clusters 2 and 3.
- Digital Campaigns for All Clusters: Focus on targeted digital campaigns, particularly for Clusters 0 and 2, as they have moderate to high online engagement.
6. Develop Campaigns to Boost Responses¶
- Use the insights from Clusters 2 and 3 (which show higher response rates) to refine campaign targeting and messaging. Emulate successful strategies used for Cluster 3 to increase responses across other segments.
7. Leverage Product-Specific Insights¶
- Promote popular categories (e.g., wines, gold products) to high-value clusters, while running introductory campaigns for less-engaged clusters to familiarize them with premium products.
By focusing on these strategies, the company can enhance engagement, increase revenue, and strengthen customer loyalty across all clusters.